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A Direct Estimation Approach to Sparse Linear 
Discriminant Analysis* 

Tony Cai and Weidong Liu 

Abstract 



["tI ■ This paper considers sparse linear discriminant analysis of high-dimensional 

data. In contrast to the existing methods which are based on separate estimation 
of the precision matrix fl and the difference d of the mean vectors, we introduce a 
simple and effective classifier by estimating the product flS directly through con- 



K^ ' strained £i minimization. The estimator can be implemented efficiently using linear 



programming and the resulting classifier is called the linear programming discrimi- 
nant (LPD) rule. 

The LPD rule is shown to have desirable theoretical and numerical properties. 
It exploits the approximate sparsity of flS and as a consequence allows cases where 
S^ . it can still perform well even when ft and/or S cannot be estimated consistently. 

Asymptotic properties of the LPD rule are investigated and consistency and rate 
of convergence results are given. The LPD classifier has superior finite sample 
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performance and significant computational advantages over tlie existing methods 
that require separate estimation of fl and S. The LPD rule is also applied to 
analyze real datasets from lung cancer and leukemia studies. The classifier performs 
favorably in comparison to existing methods. 

Keywords: Classification, constrained /i-minimization, Fisher's rule, linear discriminant 
analysis, naive Bayes rule, sparsity. 



1 Introduction 

Classification is an important problem which has been well studied in the classical low- 
dimensional setting. In particular, linear discriminant analysis (LDA), which uses a linear 
combination of features as the criterion for classification, has been shown to perform well 
and enjoy certain optimality as the sample size tends to infinity while the dimension is 
fixed. Consider two |)-dimensional normal distributions N{fii, S) (class 1) and A^(/X2, S) 
(class 2) with the same covariance matrix. Let Z he a random vector that is drawn from 
one of these two distributions with equal prior probabilities. The goal of classification is 
to determine from which class Z is drawn. The problem is simple in the ideal setting 
where the parameters /x^, /Xg, and S are known in advance. In this case, Fisher's linear 
discriminant rule 

Mz) = i{iz-fi)'fid>o}, (1) 

where /x = (fii + /X2)/2, d = fii — fj,2 ^^^ ^ = ^~^) classifies Z into class 1 if and only 
if iPf{Z) = 1. This classifier is the Bayes rule with equal prior probabilities for the two 
classes and is thus optimal in such an ideal setting. 

Fisher's rule can be used to serve as an oracle benchmark, but it is typically not 
directly applicable in real data analysis as the parameters are usually unknown and need 
to be estimated from the samples. It is a standard practice to separately estimate fl and 
6 and then plug the estimates into ([T]) to construct a classifier. Let {Xk', 1 < A; < ni} 
and {Yk] 1 < k < n2} he independent and identically distributed random samples from 
N{fii, S) and N{fi2i ^) respectively. The classical estimates of ^i^, ^2 ^^'^ ^ ^^ the low- 
dimensional setting are the sample means X and Y and the inverse sample covariance 
matrix S„ . Plugging these estimates into ([I]) results in tpplZ), the empirical version of 
ipF^Z). Theoretical properties of ipplZ) has been well studied when p is fixed and can 



be found, for example, in Anderson (2003). 

With dramatic advances in technology, high-dimensional data are now routinely col- 
lected in a wide range of applications and classification for these data has drawn con- 
siderable recent attention. Examples include genomics, functional magnetic resonance 
imaging, risk management and web search problems. In the high-dimensional settings, 
the standard LDA performs poorly and can even fail completely. For example, Bickel 
and Levina (2004) showed that the LDA can be no better than random guessing when 
p/{ni + 7^2) — )■ C)0. In such a setting, the sample covariance matrix S„ is singular and its 
inverse is not well defined. One natural remedy is to use instead the generalized inverse 
of the sample covariance matrix. However, such an estimate is highly biased and unstable 
and will lead to a classifier with poor performance when p is large. A naive method in 
this case is to simply ignore the dependence among the variables and replace S with 
the diagonal of the sample covariance matrix. This leads to the so-called naive Bayes 
rule, also called the independence rule; see Bickel and Levina (2004). Assuming that the 
difference 5 is sparse. Fan and Fan (2008) proposed the features annealed independence 
rule which applies the naive independence rule to a set of selected important features of 5 
that are chosen by thresholding. This rule ignores the correlations between the variables 
and can be inefficient. See Section |6] for further discussions. 

In the high- dimensional setting, regularity conditions on O (or S) and 6 are needed 
to ensure that they can be estimated consistently. The most commonly used structural 
assumptions are that Q, (or S) and 5 are sparse. Under such assumptions, Q, and 6 
are estimated separately and are then plugged into the Fisher's rule ([T]). Assuming the 
covariance matrix S and the difference 5 are sparse, Shao, et al. (2011) used the thresh- 
olding procedures for estimating S and 5. More commonly in applications the sparsity 
assumption is on the precision matrix f2 instead of S. In such a setting, Rothman, et al. 



(2008) used the Glasso estimator for f2 in ([T]). Witten and Tibshirani (2009) proposed 
the scout procedure for classification in which they replaced ft with a shrunken estimate. 
See also Friedman (1989), Tibshirani, et al. (2002), Guo, et al. (2007), Wu, et al. (2009), 
and Hall, et al. (2009) and the reference therein. 

A simple but important observation is that the Fisher's discriminant rule ([1]) depends 
on f2 and d only through their product flS. In the present paper, we shall show that the 
product flS can be estimated directly and efficiently, even when Q, and/or S cannot be well 
estimated individually. We introduce the following direct estimation method for sparse 
linear discriminant analysis by estimating Q,6 through a constrained ii minimization 
method. Specifically, we propose to estimate f3* := fid by 

^ e argmin{|/3|i subject to |S„/3 - {X - Y)\oo < A„}, 

where A„ is a tuning parameter, and classify Z to class 1 if and only if 

{Z-fi)'^>0, 

where /t = {X + Y)/2. The estimator (3 can be implemented easily using linear pro- 
gramming. The resulting classification procedure is thus called the linear programming 
discriminant (LPD) rule. The LPD rule is data-driven and easy to implement. It has 
significant computational advantage over the existing methods that require separate es- 
timation of fi (or S) and (5, because it only requires the estimation of a p-dimensional 
vector via linear programming instead of the estimation of the inverse of a p x p covariance 
matrix. 

Both the theoretical and numerical properties of the LPD rule are studied in this 
paper. The LPD rule performs well when fl6 is approximately sparse, which is a weaker 
and more fiexible assumption than that both Q, and 6 are sparse. In particular, under 
this assumption the precision matrix f2 is not required to be sparse and may not be 



consistently estimable. The asymptotic properties of the LPD rule are investigated and 
consistency and rate of convergence results are given. In addition to the Gaussian case, 
extensions to the non-Gaussian distributions are also considered. Numerical performance 
of the LPD classifier is investigated using both simulated and real data. A simulation 
study is carried out and the numerical results show that the LPD rule has superior finite 
sample performance in comparison to several other classifiers. It significantly outperforms 
the alternative methods in terms of the average misclassification rate. The LPD rule is 
also applied to the analysis of real datasets from lung cancer and leukemia studies and 
performs favorably in comparison to existing methods. 

The rest of the paper is organized as follows. Section [2] introduces a constrained £i 
minimization method for the direct estimation of 0,6 which leads to the LPD classification 
rule. Section [3] investigates the asymptotic properties of the LPD rule in the Gaussian 
setting. Extensions to non-Gaussian distributions are given in Section HI Section [5] first 
discusses the linear programming implementation of the LPD classifier, and then investi- 
gates the numerical performance of the LPD rule by simulations and by applications to 
lung cancer and leukemia datasets. Discussions of our results and other related work are 
given in Section El The main results are proved in Section [71 

2 Classification via direct estimation of flS 

In this section we introduce a constrained ii minimization method for estimating the 
product nd directly. It will be shown in Sections [31 - [51 that the resulting classification 
rule enjoys desirable properties theoretically, computationally, and numerically. For ease 
of presentation, we shall focus on the Gaussian case in this section and Section [31 The 
non-Gaussian case is considered in Section [H We begin by reviewing basic notation and 
definitions. 
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For a vector f3 = (/3i, . . . , Pp)' G BP, define the 4 norm by |/3|o = Ei=i ^{/^i 7^ 0}; 
the ig norm by |/3|q = (XliLi lAI'^)^^'' for 1 < g < 00 with the usual modification for 
q = 00. The vector /3 is called /c-sparse if it has at most k nonzero entries. For a matrix 
n = {uJij)pxp, the matrix 1-norm is defined to be the maximum absolute column sum, 
||f2||/^^ = niaxi<j<p ^^^^ \uJij\- For a matrix ft, we say f2 is /c-sparse if each row/column 
has at most k nonzero entries. For two sequences of real numbers {a„} and {6„}, write 
o-n = 0{bn) for n > 1 if there exists a constant C such that |a„| < C|6„|, write a„ = o(6„) 
if lim„_j.oo dn/bn = 0, and write a^ x 6„ if there are positive constants c and C such that 
c < ttn/bn < C for all n > 1. 

Recall that {X^', 1 < A; < rii} and {1^^; 1 < /c < ^1-2} are independent and identically 
distributed random samples from N{fj,-^, S) and N{^2i ^) respectively. Set 

^ = — VX„ Y = —y^Y,, 5 = X-Y, fi={X + Y)/2. (2) 

Denote the sample covariance matrices by 

Sx = - Y.^X, - X)(X, - X)', ^Y = - Y.{Y. - Y){Y, - Y)', 

and set 



1=1 1=1 



77, 

where n = rii + n2. 

As mentioned in the introduction, most of the classification methods in the literature 
involve separate estimation of the unknown precision matrix f2 = I]~^ and the difference 
of the means d in the Fisher's rule ([1]). In the high-dimensional setting, the sample 
covariance matrix S„ is typically not invertible and regularity conditions are needed in 
order to be possible to construct good estimators. It should be noted that accurate 
estimation of a large sparse precision matrix is a difficult and computationally costly 
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problem itself. See, e.g., Ravikumar, et al. (2008), Yuan (2009), and Cai, Liu and Luo 
(2011). 

It is clear that the Fisher's rule ([1]) depends on fl and 6 only through their product 
nS. We now introduce a constrained £i minimization method to directly estimate the 
product n6 by exploiting the (approximate) sparsity of flS. We should note here that 
the sparsity of fid is a weaker and more flexible condition than the sparsity of both Q, and 
6. In particular, it does not require the precision matrix Q, to be sparse. See Remark [1] 
below for more discussions. Speciflcally, we propose to estimate f3* := flS by the solution 
to the following optimization problem: 

^Gargmin{|/3|i subject to |S„/3 - (X - r)U < A„}, (3) 

fieiRp 

where A^ is a tuning parameter which will be specifled later. The constrained ii min- 
imization method ([3]) is known to be an effective way for reconstructing sparse signals. 
The readers are referred to Donoho, et al. (2006) and Candes and Tao (2007) for more 
details on the ii minimization methods for sparse signal recovery. We shall show that the 
direct estimate leads to a classifler that is more effective and efficient than those based 
on estimating f2 and 6 separately. 

Given the solution /3 to ([3]), we propose the following classification rule: classify Z to 
class 1 if and only if 

(Z - A)'^ > 0. (4) 

The optimization problem ([3]) can be cast as a linear program. We shall call the discrim- 
inant in (jl]) the Linear Programming Discriminant (LPD) and the classification rule (jl]) 
the LPD rule. 

The motivation behind the constrained ii minimization method ([3]) for estimating 
/3* = fid directly can be easily seen as follows. Note that /3* is the solution to the 



equation S/3 — 6 = 0. When S and S are unknown, they are replaced by their respective 
sample versions S„ and 6 = X — Y. We then seek the most sparse solution within the 
feasible set 

{(3: |S„/3-(X-r)U<A4 

to account for the variability in S„ and 6. The convex relaxation of using ii minimization 
in place of io minimization is a standard technique in sparse signal recovery. We shall 
show in the next sections that the resulting classification rule (jl]) has desirable properties 
both asymptotically and numerically. The ii minimization method ([3]) works well when 
fid is approximately sparse. It thus allows the case where f2 itself is not sparse. In other 
words, it is possible to classify Z with accuracy using the classifier (jl]) even when f2 
cannot be estimated consistently. 

In addition to its good performance in terms of classification accuracy, the classifier 
given in (J4]) also enjoys significant computational advantages over existing methods that 
require separate estimation of fl and S. This can be seen at an intuitive level. There 
is only p parameters in nS, while one needs to estimate p'^/2 parameters if f2 and 6 
are estimated separately. More discussions on the computational issues will be given in 
Section [5l 

Remark 1 It is easy to see that if d is /ci-sparse and fl is /c2-sparse, then fid is at most 
^1^2-sparse. Furthermore, the sparsity of fid does not require f2 being sparse. Suppose 
S is /ci-sparse and without loss of generality assume the nonzeros are among the first ki 
coordinates. (In general we can always re-order the rows/columns of f2 accordingly.) So, 
d can be written as 

/ 




where Si is a fci-dimensional vector. Write f2 as 



i Z2I ^ ' 



21 i'22 



where flu is /ci x ki, r22i is (p — /i^i) x ki, and f222 is (p — ^1) x (p — ki). Then the sparsity 
of fid does not depend on the submatrix r222 at all. flS = (fj"^^) is sparse if r22i is 
sparse. In particular, if there are at most k2 nonzero elements on each column of $7215 
then flS is /ci(/c2 + 1) sparse. No condition on $722 is needed. In general, it is not possible 
to consistently estimate $7 under the spectral norm without regularity conditions on $722- 
The consistency of 17 was required by Shao, et al. (2011) through the invertibility of the 
estimated covariance matrix S and for the good asymptotic performance of the resulting 
classification rule. In fact, even when fl is the identity matrix, joint estimation of flS by 
(|3]) may lead to a better misclassification rate than estimating $7 and 6 separately as in 
Shao, et al. (2011). See Remark 5 for more details. 

Finally we note that there are also cases that neither fl nor 5 is sparse, but fid is. 
For example, if 5 = {ctu, . . . , api) , the first column of S, then flS = (1, 0, . . . , 0) . Hence 
the sparsity on flS is more flexible than assuming both i7 and 6 are sparse. 

3 Asymptotic properties 

We now turn to the theoretical properties of the LPD classifier given in (j4]). Both consis- 
tency and convergence rate results are given. We shall focus on the Gaussian case in this 
section. Extensions to the non-Gaussian case are discussed in Section H] and numerical 
performance of the classifier will be considered in Section [51 

The misclassification rate of the Fisher's rule (see, e.g., Anderson (2003)) is 

R:=l- <l>(Ay2) with Ap = 6'nd, (5) 
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which is the best possible performance in the ideal setting where all the parameters n^, /X2, 
and S are known in advance. This can serve as an oracle benchmark for the performance 
of any data-driven classifier based on the samples {X^} and {1^^}. 

It is not difficult to calculate that, given the samples {Xk} and {l^fc}, the conditional 
misclassification rate of the LPD rule is 

where f3 is given in ([3]). The performance of the LPD rule can then be naturally measured 
by the difference (or ratio) between /2„ and the Bayes misclassification rate R. In this 
section we will study the difference and ratio between i?„ and R. To this end, we need to 
introduce some conditions. 

(CI), rii X 722, logp < n, Cq^ < Amin(S) < Amax(S) < cq for some constant cq > and 
Ap > ci for some Ci > 0. 

Here we assume that the two samples are of comparable sizes and the eigenvalues of 
the covariance matrix S are bounded from below and above. These are commonly used 
conditions in the high dimensional setting. In addition, we also assume Ap is bounded 
away from zero. If Ap — )■ 0, then it can be seen easily from ([5]) that even the oracle rule 
is no better than random guessing. 

Our first result is on the consistency of Rn- 

Theorem 1 Let \n = C^jAplogp/n with C > being a sufficiently large constant. 
Suppose (CI) holds and 



Then we have as n -^ 00 and p — t- 00, 

Rn-R^O (7) 

11 



in probability. 

This theorem shows that the LPD rule is consistent when flS is sparse. In practice, the 
value of the tuning parameter A^ is chosen by cross-validation. See Section |5] for further 
discussions on the implementation of the LPD rule. 

Remark 2 As mentioned earlier, the condition \flS\Q = o( ^^/n/\ogp] does not require O 
to be sparse. Therefore, by estimating flS directly, we do not need a consistent estimate 
for f2 or S under the spectral norm to get the asymptotically optimal misclassification 
rate. In contrast, consistent estimation of fl is required by Shao, et al. (2011). A basic 
condition in Shao, et al. (2011) is that S =: {(Jij)pxp is (approximately) sparse with the 
sparsity So{p) of S satisfying So(p)(logp/n)(^"«)/2 = o(l) and maxi<j<p Y7j=i I'^y 1^ < ^o{p) 
for < g < 1. It follows from Cai and Zhou (2010) on the minimax rate of convergence 
for estimating sparse covariance matrices, this condition is necessary for the consistency 
under the spectral norm. 

Theorem [1] can be extended to a more general setting where Q5 is only approximately 
sparse. To state this result, we first relax the condition (CI) as follows. 

(C2). ni X 7^2, \ogp < n, maxi<j<p ctjj < K for some constant K > and Ap > Ci for 
some constant ci > 0. 

Theorem 2 Let A„ = C^yAp\ogp/n with C being a sufficiently large constant. Suppose 
(C2) holds and 

+ ^ = «U/T7rr • (8) 



Then we have as n —)■ oo and p — )■ oo, 

Rn-R^O (9) 

in probability. 
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Remark 3 It follows from the Cauchy-Schwarz inequality and (CI), 

\fid\l < \ns\o\fid\l < cl\nd\o\s\l 

and Ap > C(^^|<5|2. Thus ([6]) implies ([8]). The condition (|6]) can be further relaxed if 
the minimum magnitude of the nonzero elements of fl6 is relatively large. Let S = 
{i : {flS)i ^ 0}. If minjg5 |(f25)j| > C(logp/n)^/^, then a sufficient condition of (^ is 
\flS\o = o{n/\ogp). Condition ([8]) allows the case where QS is only approximately sparse 
with many small entries. 

Remark 4 The condition maxi<j<pO"jj < K can be relaxed. Let Kp := maxi<j<p ctjj and 



^n = C y Kp/S.p\ogp / n. Theorem [2] still holds under the condition 

\n8\i \n8\l { I n~ 



A^^ A2 \\lKp\ogp 

Here Kp can grow and may tend to infinity as p — )■ oo. 

Theorems [1] and [2] provide the consistency results for the LPD rule. Consistency is 
important, but the fact i?,„ — i? — )■ does not give a detailed description of the properties 
of a classifier. For example, when the Bayes misclassification rate i? — )■ 0, any classifier 
with i?„ — )■ is consistent. Stronger results on the rate of convergence can be obtained. 



Theorem 3 Let A„ = C^J Ap\ogp/n with C being a sufficiently large constant. Suppose 
(C2) holds and 

Then 



with probability greater than 1 — 0{p~^). In particular, if (CI) holds and 



|r2(5|nA„ = O 



lO^p 



logp 
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then 



''--i = oimoA,J'''^p 



R \' '" ^V n 

with probability greater than 1 — 0{p~^). 

Theorem [3] shows that a larger Ap imphes a worse convergence rate for the relative 
classification error Rn/R- This is in fact to be expected. When Ap is large, the classifi- 
cation problem is easy and the Bayes misclassification rate R can be very small. It then 
becomes harder for any data-driven classification rule to mimic the performance of the 
oracle rule. 

Remark 5 Due to the differences in setting, it is not directly comparable between our 
results and the results in Shao, et al. (2011). To make them comparable, it is necessary 
to assume both S and fl are sparse. For simplicity, we consider the case S = Ipxp- 
Suppose that \6\i < K for some constant K and logp = o{n). Theorem [3] shows that 
Rn/R — 1 = Op{^J\ogp/n). Let -R* be the conditional misclassification rate of the SLDA 
rule proposed in Shao, et al. (2011). Their results show that Rn/R — 1 = Op(6„) with 
bn = {\ogp/n)°'^^~^^S\2g for some < a < 1/2 and < g < 1. It is easy to see that 



^n/v logp/n — i- oo. So the LPD rule outperforms the SLDA rule in this case. 

The convergence rate in Theorem [3] can be further improved under stronger conditions. 

Theorem 4 Let A„, = Cy^Aplogp/n with C being a sufficiently large constant. Suppose 
(C2) holds and 

||f7|UJf25|o + |f25|iAf = o(y^). (10) 

Then 



R I' ' ^ V n 
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with probability greater than 1 — 0{p ^). In particular, if (CI) holds and 

\\n\\LAns\o + \ns\'J^Ap = o(^^ 



\ogp 
then 



"^"-1 = ifi^iyx /^"^^ 



R ' " \' — '° ^V n 
with probability greater than 1 — 0{p~^). 

4 Extensions 

Section [3] establishes the theoretical properties of the LPD classifier in the Gaussian 
setting. The results can be extended to a class of non-Gaussian distributions satisfying 
certain moment conditions. 

Let X and Y be two p-dimensional random vectors satisfying 



X = fi^ + Ui and Y = fi^ + U 



2, 



where Ui and U2 are independent and identically distributed random vectors with mean 
zero and covariance matrix S = {(yij)pxp- Fang and Anderson (1990) showed that the 
Fisher's rule is still optimal when JJi has an elliptical distribution with zero mean and 
density 

c,|sr^/v(w's-^u), (11) 

where / is a monotone function on [0, 00) and Cp is a normalizing constant. The optimal 
misclassification rate in this case is 

R = -p(u[ns < -d'ns) + -phj'^ns > s'nsY 
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As in Shao, et al. (2011), we relax the normality of C/i to that, for any p dimensional 
non-random vector / with |/|2 = 1 and any t E R, 



p{l!Q}'^Ui<t\ =:*(t) 



is a continuous distribution function symmetric about and does not depend on I. The 
elliptical distributions (such as fITT]) ) and the multivariate scale mixture of normals satisfy 
this condition. The conditional classification error of the LPD rule (jl]) given {X^} and 
{Yk] is 

2 ^ (^ S^)V2 ^ 2 V 5.^)1/2 / 
To obtain the convergence rate for i?„, we shall impose an additional condition: for any 

a; < and \5\ < 1, 

^{x + 5) 



<ci|(5|(|x| + l)e"^l"^' (12) 



^(x) 

for some positive constants ci,C2 which do not depend on x and 5. Note that the distri- 
bution with density function p{x) = 03(1 + |x|)~'^e~'^*'^''^ satisfies ( TT2l) . where C3 and C4 
are positive constants, (f and w are constants with < (p < 2, w > 0, or if = 0, w > 1. 

The moment conditions are divided into two cases: the sub-Gaussian-type tails and 
the polynomial-type tails. Let C/i =: (f/i, . . . , Up)' and Us = U]Q,5/ /S.p . Note that Us 
is a standardized random variable with zero mean and unit variance. 
(C3). (Sub-Gaussian-type tails) Suppose that logp < n and there exist some con- 
stants ?7 > and A'l > such that 

£e^^(r]Uf\ <Ki, and Eexp (r/C/^/dii) < i^i for alH. (13) 

(C4). (Polynomial-type tails) Suppose that for some 7, Ci > 0, p < Ci?7,^, and for 
some e > 

E|[/5|^^+^+^ < i^i and E|f/i/aV'|^^+^+^ < i^i for alH. (14) 
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Theorem 5 (i). Assume that the conditions in Theorem\^hold. By replacing the nor- 
mality with elliptical distributions satisfying (C3) or (C4), we have as ra —)■ oo andp — )■ oo, 

-R„ — i? — !■ in probability. (15) 

(a). Under the conditions in Theorem\^ [W^) and (C3) (or (C4)), 



with probability greater than 1 — 0{p~^ + n~^^^). 

Similarly, Theorem H] remains valid if the normality assumption is replaced by elliptical 
distributions satisfying (C3) or (C4). For reasons of space, we do not restate the results 
here. 

5 Numerical Investigation 

We now turn to the numerical performance of the LPD rule using both simulated and real 
data. We begin in Section 15.11 with a discussion on the implementation of the classifier 
using linear programming and the selection of the tuning parameter A„ through cross- 
validation. Section 15.21 presents simulation results and comparisons with other methods 
including oracle features annealed independence rule (OFAIR) (the support of d is as- 
sumed to be known), the nearest shrunken centroids method (NSC) proposed by Tibshi- 
rani, et al. (2002), the sparse linear discriminant (SLD) introduced in Shao, et al. (2011), 
the Naive-Bayes rule (Naive-LDA), the LDA rule with a generalized inverse (GLDA) as 
well as the oracle Fisher's rule (Oracle). The applications of the LPD rule to the analysis 
of a lung cancer dataset and a leukemia dataset are given in Section 15.31 
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5.1 Implementation of LPD 

Recall that the estimate of /3* = fid is obtained by solving the constrained ii minimization 
problem 

^ e argmin{|/3|i subject to |S„/3 - (X -Y)\oo< Xn}- 

This optimization problem is convex, and can easily be recast as the following linear 

program, 

p 
min y^ Uj 

j=i 

subject to: — /3j < Uj for all 1 < j < p 
+ Pj < Uj for all 1 < j < p 
- &'^(3 + 6k < Xn for al\l <k <p 
+ a-f.j3 — Sk < Xn for all 1 < A; < p, 
where (Si, . . . ,6p) ■.= 6 and (<Ti, . . . , a-p) := S„. 

This linear programming implementation is similar to that of the Dantzig selector in 
high-dimensional linear regression. See Candes and Tao (2007). We then apply the primal- 
dual interior-point method to solve ( IT6l) . See, for example, Boyd and Vandenberghe (2004) 
for more details on the primal-dual interior-point method. We should note that there are 
other stable algorithms based on first-order method that may be used to implement the 
optimization problem (E]); see Becker, Candes and Grant (2010). Similar to many iterative 
methods, one needs to specify a feasible initialization. To this end, we replace Xl„ in ([3]) 
by Sp = S„ + pipxp with a small positive number p (e.g. p = y\ogp/n) and take the 
initial value to be S 5. Such a perturbation does not noticeably affect the computational 
accuracy of the final solution in our numerical experiments. All the theoretical properties 



in Sections 3 and 4 still hold for p < ^J\ogp/n with the additional condition Ajnax(S) < K 
for some constant K > Q. 



The computational cost of estimating 0,6 directly through linear programming as 
described above is much smaller than that of estimating the precision matrix fl. For 
example, if one estimates Q using the method in Yuan (2009) or the CLIME method in 
Cai, Liu and Luo (2011), the computation cost is p times to that of estimating f2<5 directly 
bydSD. 

There is a tuning parameter A = A„ in the algorithm. As mentioned before, A can 
be chosen empirically by cross validation (CV). This can be done as follows. Divide the 
sets {1, 2, . . . , rii} and {1,2,..., -0.2} into 2N subgroups Hn, . . . , Hu^ and H21, . . . , H2n- 
Thus the samples {Xi, Yj] 1 < i < ni,l < j < 77-2} are divided into Xk := {Xi, Yj] i G 
Hik.i G H-2k}, 1 < k < N. Let /i^^-), d(^k) and S(fc) be defined as in (|2]), based on 
{Xi, Yj] I < i < ni,l < j < ^2} \ Xk- For any given choice of A, calculate j3/j^^{X) based 
on 5(fc) and S(fc) by ©. Let /]? = 1 if (X,- -/X(fc))';9(fc)(A) > for Xj e X^; else jjf^ = 0. 
Similarly, define jjs^ = 1 if {Yj - /i(fe))'^(^,)(A) < for Yj G Xk] else /jj^ = 0. Then 

N 



cvw = E ( E 4? + E 'i?) 



k=l idH^k J&H2k 

is the total number of correctly classified cases among the validation sets for the classifier 
with a given choice of A. The final choice of A is A = maxACV"(A). If the maximum is 
attained at several A's, the minimum value of these A's is selected. 

5.2 Simulation results 

We now present simulation results and compare the numerical performance of the LPD 
classifier with the oracle features annealed independence rule (OFAIR) (Fan and Fan 
(2008)) where the support of the difference d is assumed to be known, the nearest shrunken 
centroids method (NSC) (Tibshirani, et al. (2002)), the sparse linear discriminant (SLD) 
(Shao, et al. (2011)), the Naive-Bayes rule (Naive-LDA), the LDA rule with a generalized 
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inverse (GLDA) and the oracle Fisher's rule (Oracle). The oracle rule is included as a 
benchmark. 

The setup in the simulation study is as follows. We fix the sample sizes 71-1 = ^2 = 200 
and set ^^ = and /Xg = (1, . . . , 1, 0, . . . , 0) , where the number of I's is sq = 10. Three 
models are considered. 

• Model 1. f2 = {crij)~^p with (Ju = 1 ioi 1 < i < p and o"jj = p with p = 0.5 for i 7^ j. 

• Model 2. n = (B + SI)/(1 + S), where B = {bij)pxp with independent % = bji = 
0.5 X Ber(l, 0.2) ior 1 < i < Sq, i < j < p; bij = bji = 0.5 ioi Sq + I < i < j < p; bu = 
1 for 1 < i < p. Here Ber(l, 0.2) is a Bernoulli random variable which takes value 1 
with probability 0.2 and with probability 0.8; and 6 = max(— Amm(-B), 0) +0.05 to 
ensure that f2 is positive definite. Finally, the matrix is standardized to have unit 
diagonals. 

• Model 3. 17 = 5]"\ where S = ((T.y)pxp with a^ = O.S'*"-'! for 1 < i,j < p. 

n in Model 1 is an approximately sparse matrix. It is diagonally dominant with the 
off-diagonal entries of order p~^. In Model 1 f2<5 is also approximately sparse. In Model 
2, only the first Sq rows and columns of f2 are sparse and the rest of the matrix is not 
sparse. In Model 3, S can be well approximated by a sparse matrix and the inverse ft is 
a 3-sparse matrix. Model 3 satisfies the conditions in both Shao, et al. (2011) and the 
present paper. This enables a fair comparison between the SLD in Shao, et al. (2011) 
and the LPD rule. 

In the simulation, we generate rii = n2 = 200 training and test samples of the same size 
according to Models 1-3 with the multivariate normal distribution and the multivariate t 
distribution with five degrees of freedom. The tuning parameter A„ is chosen by five-fold 
cross validation as described in Section 5.1. Note that the covariance matrix S in Models 
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1 and 2 are not sparse. So the thresholding estimator for S used in Shao, et al. (2011) may 
be not invertible. The generahzed inverse of the thresholding estimator is used when the 
estimator itself is not invertible. The SLD rule in Shao, et al. (2011) requires to choose 
two tuning parameters by cross validation. To reduce the computational cost, when 
implementing the SLD rule we assume the support of 5 is known so that only the tuning 
parameter for estimating the covariance matrix is needed. The average classification 
errors for the test samples and the standard deviations based 100 replications are stated 
in Tables [T] and [2J 

Table [1] displays the numerical results of the six classifiers as well as the oracle Fisher's 
rule in the Gaussian case. For Model 1, the performance of the LPD rule is similar to 
that of the oracle Fisher's rule, and is better by a large margin than those of the other 
five classifiers OFAIR, NSC, SLD, Naive-LDA and GLDA. Comparing to these methods, 
LPD has the smallest classification errors with the smallest standard deviations. The 
classification error is also quite stable as p increases from 100 to 800. The performance 
of SLD is not stable in Model 1 because S is not sparse and the generalized inverse of 
the thresholding estimator is used. For Models 2 and 3, the LPD rule again significantly 
outperforms the other five classifiers. The misclassification rate of the LPD rule in Model 

2 is less than half of those of the other five methods. 

Table [2] shows the corresponding numerical results in the case of the multivariate t^ 
distribution. In comparison to the results for the Gaussian case given in Table [Tj it can 
be seen from Table [2] that the classification errors of all methods including the oracle rule 
increase when the tail of the distribution becomes heavier. In this case the performance 
of the LPD rule remains close to that of the oracle rule in Model 1 and the LPD classifier 
again significantly outperforms OFAIR, NSC, SLD, Naive-LDA and GLDA in all of the 
three models. 
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p 


LPD 


OFAIR 


NSC 


SLD 


Naive-LDA 


GLDA 


Oracle 








1 


VIodel 1 








100 


2.42(0.78) 


25.07(2.01) 


18.58(8.27) 


3.20(0.89) 


21.39(11.53) 


3.54(1.00) 


1.60(0.07) 


200 


2.45(0.75) 


24.80(1.85) 


17.70(9.18) 


6.23(1.35) 


25.51(11.94) 


7.28(1.53) 


1.51(0.60) 


400 


2.27(0.83) 


24.28(2.28) 


19.35(8.35) 


41.45(4.32) 


32.12(12.16) 


41.95(4.18) 


1.41(0.57) 


800 


2.51(1.08) 


24.51(2.03) 


19.40(8.31) 

1 


13.28(1.97) 
VTodel 2 


39.57(9.43) 


17.24(2.20) 


1.30(0.61) 


100 


3.23(0.99) 


13.88(3.10) 


13.38(4.90) 


10.73(5.15) 


19.08(6.57) 


3.53(0.98) 


1.62(0.64) 


200 


5.12(1.24) 


25.75(4.07) 


26.13(5.88) 


18.92(7.61) 


36.15(5.05) 


8.21(1.37) 


1.83(0.64) 


400 


8.18(1.59) 


18.04(3.24) 


21.01(5.61) 


20.87(8.32) 


37.85(4.70) 


43.90(3.70) 


2.64(0.78) 


800 


14.87(2.27) 


23.52(2.69) 


30.40(4.49) 

1 


26.48(4.82) 
VTodel 3 


45.59(3.38) 


32.12(2.92) 


3.12(0.80) 


100 


18.93(2.08) 


24.92(2.00) 


25.06(2.04) 


25.09(2.61) 


25.65(2.22) 


22.63(2.25) 


16.55(1.74) 


200 


19.42(2.14) 


24.81(1.95) 


25.02(2.07) 


25.40(4.96) 


26.23(2.18) 


29.31(2.42) 


16.47(1.94) 


400 


19.64(2.47) 


24.50(2.31) 


24.73(2.47) 


24.60(2.49) 


27.56(2.39) 


47.46(3.21) 


16.44(2.21) 


800 


19.90(2.34) 


24.94(2.26) 


25.24(2.43) 


25.37(3.25) 


29.70(2.36) 


34.41(2.61) 


16.61(2.04) 



Table 1: Average classification error for the test samples in percentage in the normal 
distribution case. Standard deviations are given in parentheses. 
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p 


LPD 


OFAIR 


NSC 


SLD 


Naive-LDA 


GLDA 


Oracle 










Model 1 








100 


6.70(1.20) 


30.12(2.26) 


23.84(8.33) 


7.79(1.36) 


27.22(10.82) 


8.50(1.42) 


5.02(1.17) 


200 


6.64(1.28) 


30.12(2.07) 


24.98(8.26) 


11.76(1.80) 


31.23(12.37) 


13.59(2.07) 


4.61(1.06) 


400 


6.29(1.38) 


29.90(2.47) 


25.40(8.29) 


43.05(4.95) 


39.37(8.85) 


44.19(3.71) 


4.38(0.85) 


800 


5.86(1.09) 


30.02(2.21) 


26.60(7.81) 


19.73(2.40) 
Model 2 


41.83(10.24) 


25.25(2.56) 


4.06(1.09) 


100 


8.02(1.35) 


19.90(3.51) 


20.60(6.04) 


17.25(8.80) 


27.34(6.85) 


8.46(1.47) 


4.91(0.92) 


200 


11.06(1.89) 


30.40(4.15) 


31.67(5.30) 


42.73(13.97) 


40.80(4.08) 


14.57(2.09) 


5.15(1.04) 


400 


15.15(1.96) 


25.10(3.80) 


30.13(5.71) 


39.12(8.33) 


42.63(3.63) 


45.20(3.40) 


6.33(1.28) 


800 


23.19(2.12) 


30.60(3.73) 


36.40(4.08) 


31.17(4.08) 
Model 3 


47.25(2.55) 


38.00(2.75) 


7.28(1.33) 


100 


24.04(2.34) 


29.30(2.13) 


29.55(2.24) 


29.80(3.19) 


30.52(2.23) 


28.81(2.55) 


21.46(1.95) 


200 


25.01(2.18) 


29.23(2.02) 


29.39(2.12) 


29.74(4.00) 


31.80(2.20) 


34.88(2.49) 


21.76(1.98) 


400 


25.61(3.08) 


29.27(2.29) 


29.57(2.30) 


29.79(4.41) 


33.26(2.77) 


48.00(2.79) 


21.70(2.25) 


800 


25.92(2.59) 


28.88(2.12) 


29.08(2.27) 


29.11(2.09) 


35.50(2.51) 


39.43(2.61) 


21.60(2.52) 



Table 2: Average classification error for the test samples in percentage in the t^ distribu- 
tion case. Standard deviations are given in parentheses. 
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Support recovery by /3 is also considered in the simulation. We only consider Model 
3, in which fid has 11 nonzero elements. Note that in Model 1 all of the elements of fl5 
are nonzero and most of the elements of fid (more than 88%) in Model 2 generated by 
our simulation are nonzero. We thus do not consider support recovery for Models 1 and 
2. In Model 3, the number of nonzero elements (POS) and true nonzero elements (TPOS) 
in /3 are calculated. The ability to recover the support is evaluated via the true positive 
rate (TPR) in combination with the false positive rate (FPR), defined respectively as 

^ #{2 : A ^ and {^6)^ ^ 0} ^ #{z : ^ t^ and {^6\ = 0} 

#{z : (f2(5), ^ 0} '''' ^{^■.{nS)i = 0} 

The simulation results are summarized in Table [3l We can see that our method leads to 
a sparse solution (3. It correctly recovers more than 8 nonzero elements in the normal 
distribution case and more than 7 nonzero elements in the t^ distribution case in average. 
Note that FPR is low, and thus most of zero positions can be recovered by f3. 

Finally, we carry out a simulation study to investigate the accuracy between the tuning 
parameter A chosen by CV and the optimal value Xopt which minimizes the misclassifi- 
cation rate for the test samples. The results are stated in Table H] for Models 1-3 with 
the multivariate normal distribution. It can be seen that the value A chosen by CV 
and the optimal choice Xopt are quite close. Additional simulation results show that the 
performance of the LPD rule using A is similar to that using the optimal choice Xopt- 

5.3 Real data analysis 

In addition to the simulation results presented above, we also apply the LPD classifier to 
the analysis of two real datasets, one from a lung cancer study (Gordon, et al. (2002)) and 
another from a leukemia study (Golub, et al. (1999)) to further examine the performance 



of the LPD rule. The lung cancer dataset is available at |http : / /www. chestsurg . org and the 



leukemia dataset is available at http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi 

24 



p 


100 


200 


400 


800 




Normal distribution 




POS 


21.92(9.75) 


26.39(17.88) 


23.06(14.14) 


25.67(16.96) 


TPOS 


8.47(0.36) 


8.14(0.36) 


8.25(0.36) 


8.36(0.36) 


TPR 


0.77(0.11) 


0.74(0.11) 


0.75(0.11) 


0.76(0.11) 


FPR 


0.15(0.11) 


0.10(0.10) 


0.04(0.04) 


0.02(0.02) 






is distribution 




POS 


18.69(10.17) 


23.97(17.65) 


22.98(16.96) 


21.48(14.90) 


TPOS 


7.15(0.33) 


7.15(0.40) 


7.15(0.36) 


7.05(0.43) 


TPR 


0.65(0.10) 


0.65(0.12) 


0.65(0.11) 


0.64(0.13) 


FPR 


0.13(0.11) 


0.09(0.09) 


0.04(0.04) 


0.02(0.02) 



Table 3: Support recovery of Q,6 for Model 3. Standard deviations are given in parenthe- 
ses. 



Table 4: Average of A and Xopt (SD). 



p 


^ ^opt 


A 


A opt 


A Xopt 




Model 1 


Model 2 


Model 3 


100 


0.13(0.05) 0.12(0.05) 


0.14(0.09) 


0.11(0.07) 


0.18(0.02) 0.14(0.06) 


200 


0.15(0.05) 0.14(0.04) 


0.13(0.04) 


0.11(0.04) 


0.19(0.02) 0.17(0.06) 


400 


0.20(0.05) 0.18(0.05) 


0.19(0.06) 


0.18(0.08) 


0.24(0.05) 0.21(0.05) 


800 


0.23(0.05) 0.20(0.04) 


0.17(0.06) 


0.15(0.03) 


0.26(0.05) 0.24(0.04) 
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5.3.1 Lung cancer data 

The lung cancer dataset in Gordon, et al. (2002) consists of 181 tissue samples and 
each sample is described by 12533 genes. Among the 181 tissue samples, there are two 
classes of tissue samples including 31 malignant pleural mesothelioma (MPM) and 150 
adenocarcinoma (ADCA). Distinguishing MPM from ADCA is important and challenging 
from both clinical and pathological perspectives. This dataset has been analyzed in Fan 
and Fan (2008) using FAIR and NSC. In this section we apply the LPD rule to this dataset 
for disease classification. 

The sample variances of the genes range over a wide interval. After rescaled by a factor 
of 10^, there are 165 genes with the sample variances larger than 10^ and 41 genes with the 
sample variances smaller than 10"^. See Figure[T]for a plot of the sorted sample variances. 
To ensure numerical stability, we drop these 206 genes to control the condition number of 




2000 4000 6000 8000 10000 12000 14000 




1000 2000 3000 4000 5000 6000 7000 8000 



(a) Lung cancer data 



(b) Leukemia data 



Figure 1 : The diagonal of the sample covariance matrix 

Sp so that the numerical solution of S is accuracy. We use 32 training samples with 16 
from MPM and 16 from ADCA. The rest 149 samples with 15 from MPM and 134 from 
ADCA are used for testing. To reduce the computational costs, only 3000 genes with the 
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largest absolute values of the two sample t statistics are used. The classification result is 
satisfactory, although only 3000 genes are used. Two-fold cross validation method is used 
for choosing the tuning parameter A„. The resulting estimate of fid contains 369 nonzero 
elements, which is about 12.3% of all elements. Classification results are summarized in 
Table [51 The LPD rule classifies all of 149 testing samples correctly. In contrast, the 
navie-Bayes rule misclassifies 12 of 149 testing samples and GLDA misclassifies 7 of 149 
testing samples. Fan and Fan (2008) report a test error rate of 7/149 for FAIR and a test 
error rate of 11/149 for NSC proposed by Tibshirani, et al. (2002). 

Table 5: Classification error of Lung cancer data by various methods. 





LPAD 


FAIR 


NSC 


Naive-LDA GLDA 


Training error 
Testing error 


0/32 
0/149 


0/32 
7/149 


0/32 
11/149 


0/32 0/32 
12/149 7/149 



5.3.2 Leukemia data 

The leukemia dataset in Golub, et al. (1999) consists of 72 tissue samples, which were 
all from acute leukemia patients, either acute lymphoblastic leukemia (ALL) or acute 
myelogenous leukemia (AML). Each sample is described by 7129 genes. Distinguishing 
ALL from AML is critical for a successful treatment. The dataset has been analyzed by 
Fan and Fan (2008). In this section, we apply the LPD rule to this dataset and compare 
the classification results with those obtained in Fan and Fan (2008) using FAIR and NSC. 
As in the analysis of the lung cancer data, we first drop 129 genes with extreme sample 
variances, either larger than 10^ or smaller than 10~^ after rescaled by a factor of 10^. 
See Figure [1] Among the 72 tissue samples, there are 38 training samples (27 in class 
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ALL and 11 in class AML) and 34 test samples (20 in class ALL and 14 in class AML). 
Similarly to the analysis of the lung cancer data, to control the computational costs, we 
only use 3000 genes with the largest absolute values of the two sample t statistics. The 
estimate of fid contains 206 nonzero elements, which is about 6.87% of all elements. The 
classification results are summarized in Table [61 The LPD rule only misclassifies 1 of 
the 34 testing samples and makes training error. In comparison, the navie-Bayes rule 
misclassifies 7 of 34 testing samples and 1 of 38 training samples. GLDA misclassifies 3 of 
34 testing samples and 1 of 38 training samples. From Fan and Fan (2008), FAIR makes 
1/34 test error rate and 1/38 training error rate, and NSC makes 3/34 test error rate and 
1/38 training error rate. 

Table 6: Classification error of Leukemia data by various methods. 





LPAD FAIR 


NSC 


Naive-LDA GLDA 


Training error 
Testing error 


0/38 1/38 
1/34 1/34 


1/38 
3/34 


1/38 1/38 
7/34 3/34 



6 Discussions 

In this paper we introduced the LPD rule for sparse linear discriminant analysis of high- 
dimensional data. The LPD classifier is based on the direct estimation of the product 
nS through constrained ii minimization which can be implemented efficiently using lin- 
ear programming. The classifier has desirable theoretical and numerical properties and 
performs well in the real data analysis. 

The LPD rule exploits the approximate sparsity of fid which can be estimated more 
efficiently than f2 can. The sparsity of Q6 can be viewed as a relaxation of the conventional 
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assumption on the sparsity of both Q and S. In certain settings, flS can be well estimated 
even when Q, is not estimable consistently which leads to the failure of some conventional 
classification methods. An interesting consequence is that the LPD classifier can still 
perform well even when Q, cannot be estimated well. This is a major advantage of the 
LPD rule over classification methods that are based on separate and good estimates of 
n (or S) and 6. Furthermore, as shown both in the theoretical results (Theorems [2] and 
[3] only require conditions on \Q6\i and not on |f25|o) and in the simulation (In Models 1 
and 2 nearly all of the elements of Q6 are nonzero), only approximate sparsity of Q6 is 
need in order for the LPD rule to perform well. 

In this paper we have focused on the case where the new observation Z has equal prior 
probabilities of belonging to either class 1 or class 2. The procedure can be extended easily 
to the case of unequal prior probabilities vti and 112. In this case, we can define the LPD 
rule by classifying Z to class 1 if and only if 

{Z-iJ,)'^>\og(7^2/n^). 

When the unequal prior probabilities tti and 7r2 are unknown, we can simply estimate them 
by vTi = Ui/n and 7r2 = 7^2 /n respectively. The LPD rule also can be directly extended 
to multi-group classification problems. Suppose there are K classes with distributions 
N{iJ,j^, S) for 1 < A; < K. In the ideal setting where all the parameters are known, the 
oracle rule classifies Z to class k if and only if 

(Z - Hki)'^^ki > for all / ^ k, 

where dki = /x^ — /^^ and /^^^ = (//^ + ^ii)/2. When the parameters are unknown and 
random samples from the distributions are available, the products ^6^1 can then be 
estimated by solving a similar linear programming as in ([3]) and an LPD classifier can be 
constructed accordingly. 
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6.1 Consequence of feature selection on classification 

In the high dimensional setting, it is a common practice to exploit sparsity by first 
selecting a small number of important features (typically by thresholding) and then make 
inference based only on the selected features. A common aspect of these methods is that 
the correlations between the variables are ignored. It should be noted that these methods 
are inefficient in general even when the zero features are known in advance and all the 
important features are selected correctly. The main reason is that those "unimportant" 
features are in fact useful and even potentially important for classification because of the 
correlations. This can be seen as follows by considering the oracle rules in various settings. 

Let us first consider the oracle independence rule which classifies Z into class 1 if and 
only if (Z — fi) D~^d > 0, where D = diag(S). It is easy to see that the misclassification 
rate of the oracle independence rule is 

1 - $(T„), where T„ = — . with D = diag(S). (17) 

It can be shown that the Fisher's rule based only on the important features outperforms 
the oracle independence rule. Write 



(1^ 



where 6i is a A;i-dimensional vector, Sn is ki x ki, S12 is (p — ki) x ki, and S22 is (p — ki) x 
{p — ki). Suppose 62 is known to be 0. Then it follows from ([5]) that the oracle Fisher's 
rule based only on the first ki variables has misclassification rate 1 — <l'((5^I]]~/5i)^/^). 
Note that in this case the oracle independence rule only depends on the important features 
and Tp in (TT7|) can be re-expressed as 

T„ = — / " \ with Dn = diag(Sn). 
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(sA 


and S = 


( ^ ^' \ 


\52) 




S12 S22 , 



It is easy to verify that 



o{2^ii Oi = max — 7— . 



Thus we have ^i^n '^i — '^p? which imphes that the oracle Fisher's rule based only on 
the important features (the first ki variables) outperforms the independence rule. This 
shows that, given only the important features are used for classification, the independence 
rule can be inefficient and correlations among the features should be taken into account. 
Although the oracle Fisher's rule based on the important features is better than the 
independence rule, it is not an efficient rule itself because ignoring the zero (or "unim- 
portant" ) features also leads to inefficiency. Write 5 as in (1181) and suppose the fact that 
(5i 7^ and (52 = is known. We next show that the oracle Fisher's rule based on all the 
features outperforms the Fisher's rule based only on the important features. Note that 
Ap = 6'Q,6 can be decomposed as follows: 

8'n6 = 6'j:^l5i + (^2 - B5i)'w-\52 - BSi), (19) 

where B = S^2^Si2. Note that W = S22 — Si2S^j^'^S]^2 is positive definite. Consequently, 
if BSi 7^ 0, then the last term in ( !T9|) is positive and hence Ap > ^iS^^j^^^i. This means 
that even if the fact that 81 ^ and ^2 = is known in advance, dropping the zero 
features would lead to inefficiency because of the correlations among all the features. 
Therefore classifiers based only on the important features are not efficient in general. 

The above analysis shows that ignoring the correlations and feature selections in gen- 
eral lead to inefficient classifiers. A better alternative is to construct a classification rule 
taking into account of all the features and their correlations. This analysis makes the LPD 
rule even more attractive in the ultra-high dimensional case where p is very large. In such 
a setting estimating the full precision matrix f2 well is very difficult if not impossible. In 
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contrast, it is relatively easy to estimate the vector fid directly. 

7 Proofs 

Throughout this section, we denote by C, Ci, C2,. . . generic constants which may vary 
from place to place. We shall omit the proof of Theorem [1] as it follows directly from 
Theorem [21 see Remark 3 in Section |3l Before proving the other main theorems, we first 
collect some technical lemmas. The following lemma is an exponential inequality from 
Cai and Liu (2011a). The proof also can be found in Cai and Liu (2011b). 

Lemma 1 Let ^i,- ■ ■ ,^n be independent random variables with mean zero. Suppose that 
there exist some t > and B„ such that 



Y.Ee,e'\^^\<Bl 



k=l 

Then uniformly for < x < Bn and n>l, 



n 

P( J]a > CtBnX^ < exp(-x2), (20) 

k=l 

where Ct = t + t~^ . 

The next lemma shows that the true Q,6 belongs to the feasible set of ([3]) with high 
probability. 

Lemma 2 (i). Under (C2) and (C3), we have with probability greater than 1 — 0{p~^), 

\±n^8~{X-Y)\^<\n. (21) 

(ii). Under (C2) and (C4), !f21]) holds with probability greater than 1 — 0{p^^ + n~'^^^). 

Proof of Lemma [2]. We only prove the lemma under (C3). The proof under (C4) 
is similar by using a truncation technique as in Cai, Liu and Luo (2011). The details 
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are given in Cai and Liu (2011b). Write X^ = /j.^^ + Uki and Y^ = /j.2 + Uk2- Set 
X = {Xi, . . . , Xp) and /x^ = (/ii, . . . , fip) . By Lemma [H we have for any M > 0, there 
exists some Ci > such that 



max P(\X, -fi,\> cJ^^^^) < 2p-^ (22) 



and 



p(|(x - ^Ji,)'m\ > Ci J^^^) < 2p-''. (23) 



Similar inequahties hold for Y and /Xg. Note that 

ni n2 



1 / 

n V ■^^ — ' ■^^ — ' ■' J n n 

z=l j=l 

=: S - ^(C/i - fx,){U, - Ml)' - ^(t/2 - /X2)(C/2 - /^2)'- 
Thus by ( 122|) and ( 123|) . it suffices to show that with probability greater than 1 — 0{p~^), 



i^o;t t w ^ ^ /maXiCTiiAplogp 

\i:nd - (/^i - /X2)|oo < c'W ^-^^ . 

For briefness, we set Zi = Un ioi 1 < i < rii and Zi^ni = C^i2 for 1 < i < 77.2. Note that 

tns - (/xi - /X2) = (s - s)ri<5 

which can be further written as 

1 '^ 

-5^(z,z;f2(5-E(z,z;ri5)). 

j=l 
We use Lemma [T]to bound the above partial sums. Write Zj = {Zn, . . . , Zip) . By (C3), 

we have Eexp{tQ\ZijZ^flS\/{ajjd flSY^'^) < Kq for some bounded constants to > and 

Kq > 0. For any constant r > 0, let ^i = {ZijZ[flS-EZijZ[QS)/{ajjd'^Sy/'^) in Lemma 

[Hand B^ = cm with some large constant c depending on r, to, -^o- Then we can get that 

for any r > 0, there exists some constant C2 > depending only on c, r, to and Kq, 



n 

max P ( y^ ^i > C2 \/n\ogp 
^<j<p VI -'^^ I 



i=l 
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< max Pf I V^J > 2-^C2^n\ogp) + max pf I V ^J > 2-^C2\A^ 

l<j<p W ^ — ^ I / 1<.3<P VI ^-^ I 



i<i<p VI -^ I / i<j<p 

1=1 i=ni+l 



r+1 



This implies that for any r > 0, 

PnSf25 - (/Xi - M2)|oo > C2 w^max cTji Ap log p/nj < Ap" 
Lemma [2] is proved. | 

We are now ready to prove Theorems [2ll5l Throughout the proof, we assume (El 



1^ — <5|oo < C^/\ogp/n, \fi — /x|oo < C^y\ogp/n and |S„ — S|oo < C^y\ogp/n for some 
large constant C > 0. The above four inequalities hold with probability greater than 
1 — 0{p^^) or 1 — 0{p~^ + n~^^^) under (C3) or (C4) respectively. 
Proof of Theorems [2] and [5] (i). By the definition of (3, we have 

\{nS)'±n^ - i^S)'d\ < A„|i75|i + 1^ - <5|oo|i^<5|i < 2A„|i7(5|i. (24) 

By ([21]), we have 

|(Od)'S J - 5'p\ < XMi + \d- S\Mi < 2A„|f2(5|i, 

which together with ( 124|) implies that 

\0 - nS)' d\ < 4:Xn\nS\i. (25) 

Thus we have 

< \ifi-fj,)'^\ + 2X^\ns\, 

< C\r-^^\^S\i + 2XJnS\i. (26) 

V n 
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Similarly 



(A - ^2)'/^ - -s'n6\ < cJ^-^^\n6\i + 2Xn\n6\i. 



We next consider the denominator in Rn- We have 



|S^ - (5|oo < |S/9 - Sn/3|oo + 2A„ < C|ri5|i 



Therefore 



logp 



n 



2\n- 



I/3S/3-/3 5I < C|f25|^ ' ^°^^ 



n 



2A„|r25|i. 



By (l25ll . we have 



1/3 S/3 - (512<5| <C|f25|^'^°^^ 



n 



6A„|r2<5|i. 



Suppose that 50,6 > M for some M > 0. By (IHD, (EEl) and (EZD, we have 



(A-Mi)'^ 



(5f2(5 



'/3S/3 V/3S/3 

This inequality implies that 



>c(a;i + o(i; 



-1/2 



> CM^/l 



l^n-^l <exp(-CM). 
Suppose that 5' il5 < M. By dHD and ([27D, we have 



1 



This together with fl26|) yields that 



'/3S/3 V/3S/3 
By flTTI) and some simple calculations 



0(1). 



<cJ^^K 



{STisy/^ 



13 11(3 



\fJiU 



< 



c\n5\lJ'-^ + Q\n5\,K 
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(27) 



(28) 



(29) 



(30) 



< c{s'nd)-^/\\ns\l\f^+\nd\,\r, 

V n 



and 



73S/3 
Note that by (fT2l). (|30l) and (132 



2^^-i(*'n^)"^ 



((511(5)1/2 



<c i»^i; /j2gzH.c4"^i. 



ra 



((5125)1/2 



^n • ' n- 



K = /2 X (l + 0(l)r„(5'f25)i/2exp (o(l)((5'f2(5)i/V„)). 



(31) 



(32) 



(33) 



By the condition ([8]) and the assumption 6 fid < M, we have (|r2(5|i + \fl^\l)^^/logp/n = 
o(l). Thus Rn = (1 + o{l))R. This together with ( 128|) prove the theorems by letting 
n,p — > oo first and then M — ;■ oo. | 



Proof of Theorems [3] and [5] (ii). Under the conditions of Theorem[3]or [5] (ii), we have 
fl29|) -f l32|) hold without assuming 6 fid < M. So Theorems [3] and [5] (ii) follow from ( 133|) 

and the fact (S f2(5)i/^r„ = o(l) immediately. | 

Proof of Theorem |4l We shall prove a better rate for \f3 S/3— (5 flS\ under the condition 
(Uni). We have 

|s(^-f25)U < |s„(^-f25)|oo + l(s„-s)(^-f25)U 



< 2Xn + c\^-ns\ 



\ogp 



n 



< 2X^ + C\n5\o\l^^^\^-flS\ 



n 



< 2Xn + C\\n\\L,\^d\o 



logp 



n 



T,0-fl5)l 



This together with \\fl\\L^fl8\oJ^-^ = o(l) implies that |S(^ - /3)|oo < CA„. Thus we 
have 

l^'s^ - ^'sf2(5| <C\flS\iXn 
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and 

\pli:ns-6'nd\ < c\nd\iXn. 

The remaining steps follow from the proof of (l33l) . | 
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